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Abstract 

Inferring gene regulatory networks (GRNs) is a major issue in systems biology, which explicitly characterizes regulatory 
processes in the cell. The Path Consistency Algorithm based on Conditional Mutual Information (PCA-CMI) is a well-known 
method in this field. In this study, we introduce a new algorithm (IPCA-CMI) and apply it to a number of gene expression 
data sets in order to evaluate the accuracy of the algorithm to infer GRNs. The IPCA-CMI can be categorized as a hybrid 
method, using the PCA-CMI and Hill-Climbing algorithm (based on MIT score). The conditional dependence between 
variables is determined by the conditional mutual information test which can take into account both linear and nonlinear 
genes relations. IPCA-CMI uses a score and search method and defines a selected set of variables which is adjacent to one of 
X or Y. This set is used to determine the dependency between X and V. This method is compared with the method of 
evaluating dependency by PCA-CMI in which the set of variables adjacent to both X and V, is selected. The merits of the 
IPCA-CMI are evaluated by applying this algorithm to the DREAM3 Challenge data sets with n variables and n samples 
(n= 10,50,100) and to experimental data from Escherichia coil containing 9 variables and 9 samples. Results indicate that 
applying the IPCA-CMI improves the precision of learning the structure of the GRNs in comparison with that of the PCA-CMI. 
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Introduction 

Bayesian networks (BNs) provide an efficient and effective 
representation of the joint probability distribution of a set of 
variables. The identification of the structure of a BN from the data 
is known to be an NP-hard problem [1]. There are many learning 
algorithms for automatically building a BN from a data set. These 
are generally classified into three classes, namely constraint-based 
methods [2-5], score and search methods [6-11] and hybrid 
methods [12-14]. 

Gene Regulatory Networks (GRNs) explain how cells control 
the expression of genes. GRN is a collection of DNA segments in a 
cell. These segments interact indirectly with each other and with 
other substances in the cell and thereby governing the rates at 
which genes in the network are transcribed into Messenger RNA. 
Modeling the causal interactions between genes is an important 
and difficult task, and indeed, there are many heuristic methods 
for inferring GRNs from gene expression data [15,16]. BN is one 
of the popular methods which have been successfully implemented 
in learning GRNs [17]. 

There is a great potential for improvement of current 
approaches for learning GRNs [18,19]. The purpose of this study 
is to introduce a new algorithm, "Improved Path Consistency 
Algorithm based on Conditional Mutual Information (IPCA- 
CMI)". The algorithm is applied to a number of gene expression 
data sets in order to evaluate the accuracy of it for inferring GRNs. 



IPCA-CMI is a combination of the PCA-CMI [5] and the Hill 
Climbing(HC) algorithm (based on mutual information test (MIT)) 

Being based on conditional mutual information (CMI), IPCA- 
CMI can take into account both linear and nonlinear genes 
relations. This is an improvement over linear testing methods. 
IPCA-CMI applies the HC algorithm (based on MIT score) to 
define weight values for each variable X . Then, a selected set 
which contains variables with weight values more than a defined 
threshold, is created. The method of evaluating dependency 
between two adjacent variables X and Y is represented by CMI 
test given a subset of genes of the selected set. To evaluate the 
accuracy of IPCA-CMI, it was employed to a number of gene 
expression data sets. For this purpose, the Dialogue for Reverse 
Engineering Assessments and Methods (DREAM) program was 
first introduced as a new efficient computation methods that help 
researchers to infer reliable GRNs [18]. The data sets comprised 
DREAM3 Challenge with n variables and n samples 
(«= 10,50,100) and Escherichia coil gene expression data containing 
9 variables and 9 samples. 

Preliminaries 

Bayesian network. Bayesian networks (BNs) [20,21], also 
known as belief networks, belong to the family of probabilistic 
graphical models. Each vertex in the graph represents a random 
variable and the edges between the vertices represent probabilistic 
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dependencies among the corresponding random variables. A 
directed edge, Y-*X, describes a parent and child relation in 
which X is the child and Y is the parent of X. Let ADJ(X) 
denotes the set of variables in the graph which are adjacent to X. 
In addition, each vertex in graph has a conditional probability 
distribution specifying the probability of possible state of the 
variable given possible combination of states of its parents. These 
conditional dependencies in the graph are often estimated by using 
known statistical and computational methods. Hence, BNs 
combine principles from Graph Theory, Probability Theory, 
Computer Science and Statistics. BNs are represented as a 
directed acyclic graph (DAG) that is popular in Statistics and 
Machine Learning subjects. We typically denote random variables 
with capital letters and sets of random variables as bold capital 
letters. Following the above discussion, a more formal definition of 
a BN can be given. A Bayesian network, B, is an annotated 
directed acyclic graph that represents a joint probability distribu- 
tion over a set of random variables X. = {X\,...,X„}. The network 
is defined by a pair B= <(G,$), where G is the DAG with vertex 
set {Xi,X2,...,X n } and the direct dependencies between these 
variables is represented by directed edges. The graph G encodes 
independence assumptions, by which each variable X, is 
independent of its non descendants given its parents in G. Let 
Pag(X) denote the parent set of X. The second component 9 
describes the set of conditional probability distributions. This set 
contains the parameter Q X i\Pti B (x i )= PB( x i\P a B( x i)), where A",- 
denotes some value of the Xj and Pag(Xj) indicates some set of 
values for Xfs parents. If Xj has no parent, then P(Xj\Pas(Xj)) is 
equal to P(Xj). By using these conditional distributions, the joint 
distribution over X can be obtained as follows: 



P(X U X 2 ,...,X„)= I\^P(Xj\Pa B (Xj)). 



Definition 1. If P(X,Y\Z) = P(X\Z)P(Y\Z), then two vari- 
ables X and Y are conditionally independent given Z. 

Definition 2. A path p from X to Y in G is said to be blocked 
by a set of variables Z if and only if: 

1. p contains a chain X— >K— >Y or a fork X<— K— >Y such that 
KeZ, or 

2. p contains a collider X— >K<— Y such that K and all the 
descendants of K are not in Z. 

Definition 3. A set Z is said to d-separate X from Y in G if 
and only if Z blocks every path from X to Y. 

Definition 4. A v-structure in G is an ordered triplet (X, Y,K) 
such that G contains the directed edges X->Y and K-»Y, so that 
X and K are not adjacent in G. 

For the following discussion, suppose that the set of parents of 
Xi is {Xj\,...,Xj Sl }, where Sj denotes the number of parents of 
X i (\Pa(X i )\=s i ). The BN deals with: 

• Discrete variables i.e. the variable Xj and its parents take 
discrete values from a finite set. Then, P{Xi\Xn,...,Xj s .} is 
represented by a table that specifies the probability of values 
for Xj for each joint assignment to {Xj\,...,Xj Sl }. 

• Continuous variables i.e. the variable Xj and its parents take 
real values. In this case, there is no way to represent all possible 
densities. A natural choice for multivariate continuous 
distributions is the use of Gaussian distributions [15]. 

• Hybrid networks i.e. the network contains a mixture of discrete 
and continuous variables. 



Information Theory. Gene expression data are typically 
modeled as continuous variables. The following steps are applied 
to calculate mutual information (MI) and CMI for continuous 
variables. MI has been widely used to infer GRNs because it 
provides a natural generalization of association due to its 
capability of characterizing nonlinear dependency [22]. Further- 
more, MI is able to deal with thousands of genes in the presence of 
a limited number of samples [23]. 

Entropy function is a suitable tool for measuring the average 
uncertainty of a variable X. Let X be a continuous random 
variable with probability density function f(x), the entropy for X 



H(X)-- 



f(x) logf(x)dx. 



(1) 



The joint entropy for two continuous variables X and Y with joint 
density function f(x,y) is: 



H(X, Y)-- 



f(x,y) logf(x,y) dx dy. 



(2) 



The measure of MI indicates the dependency between two 
continuous variables X and Y , which is defined as: 



MI(X,Y)-- 



Kx,y) log dxdy. 

s f(x)f(y) 



(3) 



Variables X and Y are independent when MI has zero value. The 
measure of MI can also be determined in terms of entropy as 
follows: 



MI(X, Y) = H{X) +H{Y)- H(X, Y). 



(4) 



In the GRN the dependency of two genes needs to be determined. 
CMI is a suitable tool for detecting the joint conditional linear and 
nonlinear dependency between genes [5,24]. CMI between two 
variables X and Y, given the vector of variables Z is: 



CMI(X,Y\Z)~- 



f{x,y,z) log > dx dy dz, (5) 



where p is the dimension of vector Z and f(x,y,z) denotes the joint 
density function for variables and f(x\z) is the conditional density 
distribution of X given Z. CMI between X and Y given Z can 
also be expressed by: 



CMI(X, Y\Z) = H(X,Z) + H( Y,Z) -H(Z)- H(X, Y,Z), 



(6) 



where H(X, Y,Z) denotes the joint entropy between X, Y and Z. 

Theorem 1 [25]: Let X = (X u ...,X n ) T be an n-dimensional 
Gaussian vector with mean fi = {p,i,...,pL n ) T and covariance matrix 
C(X) = E(X - ft )(X - fi) T , i.e. X ~ N(jt , C(X)). The entropy of X 



H(X) = log[(2n ef 2 det(C(X)) 1/2 \ = *fog[(2n efdet(C(K))], (7) 

where det(C(X)) indicates the determinant of C(X). With the 
widely adopted hypothesis of Gaussian distribution for gene 
expression data, the measure of MI according to Eqs. 4 and 7 for 
two continuous variables X and Y can be easily calculated using 
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the following equivalent formula [5,26]: 



MI(X,Y)=\log a 

1 GXY 



2 2 



(8) 



where a x , o\ and Gxy indicate the variance of X, the variance of 
Y and the covariance between X and Y. Similarly, according to 
Eqs. 6 and 7, GMI for continuous variables X and Y given Z can 
be determined by [5]: 



CMI{X, Y\Z) - deKC (Z)).det(C(X,Y,Z)y 

in which C(X, Y,Z) denotes the covariance matrix of variables X, 
Y and Z. When X and Y are conditionally independent given Z, 
then CMI(X, Y\Z) = 0. In order to test whether a GMI is zero, 
Z— statistic is calculated in two steps [5,27,28]: 

In step 1, the MI and CMI, respectively, are normalized as 
follows: 



MI{X,Y) = 



CMI(X,Y\Z) 



MI(X, Y) 
H(X) + H(Y)' 

CMI(X,Y\Z) 



(10) 



H(X,Z) + H(Y,Z) 



In step 2, the Z — statistic of MI and GMI, respectively, are 
calculated by: 



Z— statistic x y~- 



Z— statistic x,y\z '- 



1 1 + 

2 Sy l-MI(X,Y) 

1, \+CMI(X,Y\Z) 

~ log( ~ ■ — 

2 l-CMI(X,Y\Z) 



(11) 



methods are applied to discretize continuous gene expression 

data [30-32]. EWD method for .v-th gene divides the line 

between min[Jf,( : )] and max[Jf,( : )] into k intervals of 

equal width. Thus the intervals of gene s have 

... max[X.,(:)]-mm[X s (:)] 
width, w = , with cut points at 

min[Z s ( : )] +w,mm[X s ( : )] +2w,...,min[X s ( : )]+{k- l)w. In 
EWD, k is a positive integer and is a user predefined parameter. 

EFD method for s-th gene divides the sorted X s ( : ) into m 
intervals so that each interval contains approximately the same 
number of expression values. Similarly, in EFD, m is a positive 
integer and is a user predefined parameter. 

In this study, gene expression data sets related to DREAM3 
Challenge lie in the interval [0, 1]. We applied EWD method to 
discretize DREAM3 data sets. For instance, for each gene, 
parameter k is considered to be equal to 10. EFD method is 
applied to discretize SOS repair data. Gene expression data sets 
related to SOS DNA repair network lie in the interval [—0.2730, 
26.6330] and the parameter m is considered to be equal to 9. 

Scoring Function 

There are many scoring functions to measure the degree of 
fitness of a DAG G to a data set. These are generally classified as 
Bayesian scoring functions [7,9,33] and information theory-based 
scores [1 1,34—37]. The chosen score and search algorithm can be 
more efficient if the scoring function has the decomposability 
property. 

Decomposability property: A scoring function g is decom- 
posable if: 



g{G : D)=Y,gD{X i ,Pa B {X i )), 



(12) 



where 



In order to determine the statistical test of conditional indepen- 
dence, a confidence level a is fixed. When 

— |Z| — 3|Z— statistic] <<b~' (I — -) then, the hypothesis of 

conditional independence of X and Y given Z is accepted (at the 
significance level a); otherwise the hypothesis is rejected. Here O(-) 
denotes the cumulative distribution function of the standard 
normal distribution and |Z| indicates the dimension of vector Z. 

Score and Search Algorithms 

Score and search algorithms can be completely described by 
specifying two components: a scoring function and a search 
procedure. The score and search algorithms try to identify a 
network with maximum score. 

In this study, we apply the HC algorithm as a search procedure 
where MIT score is used as a scoring function. Gene expression 
data are typically continuous variables. The MIT score deals with 
discrete variables. Therefore, continuous variables have to be 
discretized. We do this based on the procedure proposed by [29- 
32]. 

Discretization methods. To draw inferences about a GRN 
based on the set of genes X = {Xi,...,X„}, we start with a data set 
D = {X l (l),...,X n {l)]},...,{X l (N),...,X„(N)}, where n indicates 
the number of genes and TV is the number of measurements of 
these genes. An n by N matrix D is used to denote gene expression 
data. X s (i) indicates the expression value of gene s at time t. X s ( : ) 
denotes expression data of gene s at all time. The equal width 
discretization (EWD) and equal frequency discretization (EFD) 



g D (X i ,Pa B (X i )) = g D (X i ,Pa B (X i ) : Nx^a^)), (13) 

and Nx b Pa B (X{) denotes the number of instances in data set D that 
match with each possible configuration of {Xf\ [j{Pa B (Xi)}. 

Another property, which is particularly interesting if the score 
and search algorithm searches in a space of equivalence classes of 
DAGs, is called the score equivalence. 

Theorem 2 [38]. Two DAGs are equivalent if and only if they 
have the same skeletons and the same v-structures. 

When two Bayesian networks are equivalent, they can represent 
the same set of probability distributions. The relation of network 
equivalence imposes a set of equivalence classes over Bayesian 
network structures [39]. 

Score equivalence: A scoring function g is score equivalence 
if the score assigns the same value to equivalent structures. 

MIT Score. Mutual information test (MIT) belongs to the 
family of information theory-based scores which is defined as 
follows [11]: 



gMir(G,D) = 



J2 {INMUXifaBiXi)) - max £ XxJ . a ), 



(14) 



where N denotes the total number of measurements in D and 
MI D (Xi,Pa B (Xi)) is determined by: 
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Table 1. The PC Algorithm based on CMI test (PCA-CMI) [5]. 



1: Start with a complete undirected graph S-i 


2: 


i=0 


3: 


Repeat 


4: 


For each XeX 


5: 


For each YeADJ(X) 


6: 


Determine if there is V X y with |M| = / such that X and Y given M are independent 


7: 


If this set exists 


8: 


Remove the edge between X and V from 


9: <=i+l 



Until i<\V X y\ 



doi:1 0.1 371 /journal.pone.0092600.t001 



MI D {X u Pa B {X^ = -<r J2 E ** fo *< 



i=l /=! fc=l 



), (15) 



where Nyk represent the number of measurements in the data set 
D for which Xi = k and Pag(Xi)=j, where j denotes a joint 
configuration of all parent variables of Xj. Ny denotes the number 
of measurements in D, in which Pas(Xi)=j. Similarly, Af«t 
indicates the number of measurements in D which the variable 
Xj = k and It^-fj) is defined by: 



i0) = 



(n-l)(r I(ri (i)-i) 7=1 



(16) 



where (7j = [ffj(l),...,(7;(,Sj)] indicates any permutation of the index 
set of the variables in Pa£(Xi) = {Xa,...,Xj s .}. Finally, 

Xa,/, >i(/) is the value such that P (X l (k<'i<j))^X«,l it(! . (j) ) = <X (Ae Chi- 
squared distribution at significance level 1 — a with lt,a-Q) degrees 
of freedom). 

The MIT score has decomposability property and dose not 
satisfy score equivalence, however, it satisfies less demanding 
property. This property of the MIT score concerns another type of 
space of equivalent of DAGs, namely restricted acyclic partially 
directed graphs (RPDAGs) [40]. RPDAGs are partially directed 
acyclic graphs (PDAGs) which represent sets of equivalent DAGs, 
although they are not a canonical representation of equivalence 
classes of DAGs (two different RPDAGs may correspond to the 
same equivalence class). 

Theorem 3 [11]. The MIT score assigns the same value to all 
DAGs that are represented by the same RPDAG. 

MIT score can be applied without any problem to search in 
both the DAG and the RPDAG spaces [1 1]. In different studies, 
the score equivalence could be concluded as a good or bad 
property. The score equivalence property is appropriate when the 
data are not applied to distinguish the equivalent structures. In 
searching and scoring scheme for learning structure of Bayesian 
networks, equivalent classes should be considered. This means 
when more than two graphs are equivalent, those graphs have the 
same dependency; therefore, two structures have identical scores. 
As an example, two variables A and B may have two different 
structures as A^B or A*-B, however, as equivalent classes, these 
two structures end up having the same score for any given data. In 
order to detect causal relationships between genes, score 



equivalence property does not necessarily impair the search 
process, because equivalent structures represent different causal 
relationships. In this study, we are interested to the scoring 
functions which considered different scores for A-^B or A<—B. So, 
MIT score is applied in the HC algorithm to compute the score of 
DAGs. The non equivalence of the score function does not 
necessarily impair the search process to learn BNs. The MIT score 
is implemented within the Elvira system (a JAVA package for 
learning the structure of BN [11]). The Elvira package can be 
downloaded from http://leo.ugr.es/elvira/. The MIT score is 
available at /bayelviralj 'elviraj 'learning / MIT Metrics. java. In 
this study, we rewrite the MIT score program (Red.Pen) which, in 
comparison to the Elvira system, reduces running time and 
memory occupied by the algorithm. The source of the program 
and data sets are available at http://www.bioinf.cs.ipm.ir/ 
software/IPCA-CMI/. 

Search Procedure 

Given a scoring function g, the task in this step relates to search 
between possible networks to find G* such that: 



G* = argmaxGeF(n) g(G : D), 



(17) 



in which g( G : D) denotes the degree of fitness of candidate G to 
data set and F(n) indicates all the possible DAGs defined on X. 
The challenging part of search procedure is that the size of the 
space of all structures, /(«), is super-exponential in the number of 
variables [41], 



Table 2. Zero order of the Improvement of PC Algorithm 
based on CMI test. 



Start with a complete undirected graph 
Repeat 

For each A'eX 

For each YeADJ(X) 

If X and Y are independent based on the measure of Ml 
Remove the edge between X and Y from S-i 
The MIT score was utilized in the HC algorithm to construct G 0 . 



doi:1 0.1 371 /joumal.pone.0092600.t002 
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Table 3. / order (/>0) of the Improvement of PC Algorithm based on CMI test. 



1 : Start with G 0 

2: !=1 

3: Repeat 

4: For each XeX 

5: For each YeADJ(X) 

6: Test whether 3H c= R XY with |H| = i such that X and V given H are independent. 

7: If this set exists 

8: Remove the edge between X and Y from G;_i 

9: The MIT score was utilized in the HC algorithm to direct the structure. 

1 0: For each Ze{ADJ(X)\JADJ(Y)}\{X, Y} 

11: The weight value for variable Z is determined by: Weight x {Z) = \A\z\ + \A2z\ + \Aiz\ — \A4z\ 

12: A selected set R XY of variables is created as: R XY = {Z\Weight x (Z)>k or Weight Y (Z)>k,1Ze{ADJ(X)\JADJ{Y)}\{X,Y)) 

13: i = i+1 

14: Until i<\R XY \ 

doi:1 0.1 371 /joumal.pone.0092600.t003 



Rri)=± (-I)'" 1 ^y2**- */(»-0. (18) 

So an exhaustive enumeration of all the structures is not possible. 
Instead, researchers have considered heuristic search strategies 
[9,42] . The Hill Climbing algorithm is particularly popular in this 
field. 

The Hill Climbing Algorithm. The Hill Climbing (HC) 
algorithm is a mathematical optimization technique which belongs 
to the family of local search. The HC algorithm traverses the 
search space by starting from an initial DAG then, an iterative 
procedure is repeated. At each procedure, only local changes such 
as adding, deleting or reversing an edge are considered and the 
greatest improvement of g is chosen. The algorithm stops when 
there is no local change yielding an improvement in g. 

Because of this greedy behavior the execution stops when the 
algorithm is trapped in a solution that is mostly local rather than 
global maximizer of g. Different methods are introduced to escape 
from local optima such as restarting the search process with 
different initial DAGs. It means that after a local optima is found 
the search is reinitialized with a random structure. This 
reinitialization is then repeated for a fixed number of iterations, 
and the best structure is selected [20]. The local search methods 
can be more efficient if the scoring function has the decompos- 
ability property [11]. By considering the decomposability proper- 
ty, by adding, deleting or reversing the edge between two 
variables, the score values of this variables are updated while the 
score values of other variables remain unchanged. In order to 
apply the HC algorithm based on scoring function with the 
decomposability property, the following differences are calculated 
to evaluate the improvement obtained by local change in a DAG 
[43]: 

1. Addition oiXj^Xr. gDiXuPaBiX^iXj^-goiXuPaBiXi)) 

2. Deletion of Xj^X C . gniX^a^XdsiXj^-g^X^a^Xd) 

3. Reversal of Xj-*Xi\ First the edge from Xj to X t is deleted 
then, a edge from Xj to Xj is added. So [g£>{Xj,PciB 
(XdMXj}) -g D (X h Pa B {X$)] + [g D (Xj,Pa B (Xj)} U{*i})-ffl> 
(Xj,Pa B (Xj)] is computed. 



Method 

In this section the details of PCA-CMI and IPCA-CMI are 
presented to show how the structure of GRN is learned from gene 
expression data sets. 

PC Algorithm based on CMI test (PCA-CMI) 

The PCA-CMI is applied to infer the GRNs [5]. The PCA- 
CMI is computationally feasible and often runs very fast on 
networks with many variables. This algorithm starts with a 
complete undirected graph over all variables. The following steps 
are applied to assign skeleton Si from 

Step 1: Generate the complete undirected graph 5, (/= — 1). 

Step 2: Set i=i+l. Suppose X and Y are adjacent in 
then Vxy is defined by: 

V XY = {ADJ(X)f]ADJ(Y)}. 

Suppose that, there are j number of genes in Vxy {\Vxy\=J}' If 
i<j, for each f-subset of Vxy such as M = {wi,...,W,}, the z'-order 
CMI(X,Y\M) is computed according to Eq. 9. All the z'-order 
CMIs between X and Y given all possible combination of i genes 
from_/ genes are computed and the maximum one was selected as 
CMI max (X,Y\M). If CMI max (X,Y\M)<e, the edge between X 
and Tis removed from So, Vxy includes the separator set 

for X and Y. The algorithm is stopped when i > j. Let S, be the 
skeleton of the constructed graph in this step and return to step 2. 

The algorithm is stopped when S,_ i = S, for the first i. 

It is notable that in each step of the PCA-CMI, XeX is selected 
from 1 to n and YeADJ(X) is selected by the order of genes. 
Details of PCA-CMI are given in table 1 . 

The Improvement of PC Algorithm based on CMI test 
(IPCA-CMI) 

The HC algorithm is the well-known approach to search 
between possible DAGs to determine the best fit of network based 
on defined scoring function. In addition, Zhang [5] have 
implemented the PCA-CMI for inferring GRNs from gene 
expression data, using CMI test in the process of dependency 
determination between genes. The skeleton of a GRN in each 
order of IPCA-CMI is determined by CMI test. Therefore, only 
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True Network F * rst Order Network First Order Network 

Inferred by PCA-CMI Inferred by IPCA-CMI 

Figure 1. Comparing the result of the PCA-CMI and the IPCA-CMI for inferring the structure of DREAM3 contains 10 variables and 
1 0 edges. (A) The true network with 1 0 variables and 1 0 edges. (B) Firs-order network inferred by the PCA-CMI. The edge with red line G2-G4 is false 
positives, while the edges G1-G2, G3-G5 and G4-G9 are false negative. (C) First-order network obtained by the IPCA-CMI. The false positive edge G2- 
G4 in (B) is successfully removed by the IPCA-CMI, in addition edges G1-G2 and G3-G5 are successfully found by this algorithm. 
doi:1 0.1 371 /journal.pone.0092600.g001 



the local changes related to reversed edges between genes are 
applied in the HC algorithm (step 3 of the algorithm) in order to 
direct the edges of the skeleton. 

When heuristic search algorithms are applied, we are not 
guaranteed to find a global optima structure. Different methods 
have been proposed to escape local optima. 

In this study, during each iteration in the HC algorithm, a new 
solution is selected from the neighborhood of the current solution 
(random change including adding, deleting and reversing). If that 
new solution has a better quality MIT score, then the new solution 
becomes the current solution. The algorithm stops if no further 
improvement are possible. We have to start with some (50 
randomly generated) solution and evaluate it based on MIT score. 
The HC algorithm can only provide locally optima that depends 
on the starting solution. We have to start the HC algorithm from a 
large variety of different solutions. The hope is that at least some of 
these initial locations have a path that leads to the global optima. 
We choose the initial solutions (50 DAGs) at random. 

Details of the IPCA-CMI are presented in two parts. Part 1 is related 
to the zero order of the IPCA-CMI. In this order, same skeletons are 
obtained by PCA-CMI and IPCA-CMI, but the HC algorithm is 
utilized in IPCA-CMI in order to direct the edges of the skeleton. 
Details of the IPCA-CMI for order i {i>0) are presented in part 2. 

Part 1: The details of IPCA-CMI for ! = 0. First, the IPCA- 
CMI generates complete graph according to the number of genes. 
Then, for each adjacent gene pair such as Xand T, the measure of 

Table 4. The result of Simulated and Real data sets in order 0. 



MI is computed according to equation [8] . The measures of MI 
between X and Y are calculated to be compared with t. If 
MI(X, Y) < e, the edge between X and Y\s removed from complete 
graph. Finally, MIT score is applied in the HC algorithm in order to 
direct the edges of skeleton to obtain the directed acyclic graph Go- 
Details of zero order of IPCA-CMI are shown in table 2. 

Part 2: The IPCA-CMI for ;>0. Set 1 = 0 and the following 
process is applied to assign directed acyclic graph G, from G,_i: 

Step 1: Set i=i+l. Let Z be an adjacent of X in G;_i. Then, 
A qz for 1 <q<4 are defined as follows: 

A 1Z = {W\X^Z^W}, A 2Z = {W\X^Z^W}, 
A iz = { W\X^Z^ W}, A 4Z = { W\X^Z^ W}. 

The weight value for variable Z ^ s determined by: 

Weight x (Z) = \A 1Z \ + \A 2Z \ + \A 3Z \ - \A 4Z \, 

where \A qZ \ for \ <q<A denotes the size of A qZ . 
Step 2: Let Rxy be defined by: 

R XY = {Z\Weight x (Z)>k or Weight Y (Z)>k, 
\IZe{ADJ{X){JADJ( Y)}\{X, Y}}, 





Network 


TP 


FP 


ACC 


FPR 


FDR 


PPV 


F 


MCC 


TPR 


DREAM10 


9 


1 


0.95 


0.02 


0.10 


0.9 


0.90 


0.87 


0.90 


DREAM50 


36 


54 


0.92 


0.05 


0.6 


0.4 


0.43 


0.39 


0.46 


DREAM100 


70 


58 


0.96 


0.01 


0.45 


0.55 


0.47 


0.46 


0.42 


SOS 


18 


4 


0.72 


0.33 


0.18 


0.82 


0.78 


0.40 


0.75 



The second row of the table shows the result of DREAM3 in size of 1 0 with threshold 0.05. The third row denotes the result of DREAM3 in size of 50 with threshold 0.1 . 
The forth row of the table indicates the result of DREAM3 in size of 100 with threshold 0.1. Finally the last row shows the result of SOS DNA repair network with 
threshold 0.01. 

doi:1 0.1 371 /journal.pone.0092600.t004 
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Table 5. The result of gene expression data set DREAM3 Challenge with 10 genes and sample number 10. 



Algorithm 


TP 


FP 


ACC 


FPR 


FDR 


PPV 


F 


MCC 


TPR 


PCA1 


7 


1 


0.91 


0.03 


0.13 


0.87 


0.78 


0.73 


0.7 


IPCA1 


8.8 


0 


0.98 


0 


0 


1 


0.94 


0.93 


0.88 



Result of DREAM3 in size of 10 with first-order CMI test with threshold 0.05. The second row of the table indicates the result of first-order PCA-CMI (PCA1) the third row 
of the table shows the result of first-order IPCA-CMI (IPCA1). 
doi:1 0.1 371 /journal.pone.0092600.t005 



where k denotes the median of weights related to all adjacent 
variables of X or Y. It can be concluded that variables in set Rxy 
are selected from {ADJ(X)\JADJ( Y)}\{X, Y} in which at least 
k number of paths started from X or Y are blocked by these 
variables. Therefore, by considering these variables many paths 
between X and Y are removed. 

Step 3: Let X and Y be adjacent in G,_i, we have done the 
following process: 

Suppose that, there are t genes in Rxy (\Rxy\ = 0- If i<t, for 
each i-subset of Rxy such as H = {Ai, ...,/;;}, the z'-order 
CMI(X,Y |H) is computed according to equation [9]. All the i- 
order CMIs between X and Y given all possible combination of t 
genes from t genes are computed and the maximum result 
{CMI max {X, F|H)) is compared with e. If CMI max {X, F|H)<£, 
the edge between X and Y is removed from G,_i. The algorithm 
is stopped when i > t. Let S, be the skeleton of the constructed 
graph in this step. 

Step 4: MIT score is applied in the HC algorithm in order to 
direct the edges of «S; to obtain the directed acyclic graph G,, 
return to step 1. 

The algorithm is stopped when G,_i = G, for the first z'Table 3 
is related to the details of i order (i>0) of IPCA-CMI. 

It is notable that in each step of the IPCA-CMI, XeX is selected 
from 1 to n and YeADJ(X) is selected by the order of genes. 

The rational behind Weightx(Z) is in definitions 2 and 3. 
Weightx(Z) indicates the number of paths started from X and 
blocked by Z. 

In fact the main difference between the IPCA-CMI and the 
PCA-CMI is in choosing a selected set of variables which includes 
the separator set. IPCA-CMI uses the HC algorithm and define a 
selected set of variables which are adjacent to one of X or Y, with 
weight values more than a defined threshold. 

Software 

Software in the form of MATLAB and JAVA codes. The source 
of data sets and codes are available at http:/ /www.bioinf.cs.ipm. 
ir/software/IPCA-CMI/. 



Results 

In order to validate our algorithm, the true positive (TP), false 
positive (FP), true negative (TN) and false negative (FN) values for 
proposed algorithms are computed. Where TP is the number of 
edges that are correctly identified, FP is the number of edges that 
are incorrectly identified, TN is the number of edges that are 
correctiy unidentified and FN is the number of edges that are 
incorrectly unidentified. In addition, some famous measures such 
as the accuracy (ACC), false positive rate (FPR), false discovery 
rate (FDR), positive predictive value (PPV), F-score measure, 
Matthews correlation coefficient (MCC) and true positive rate 
(TPR) are considered to compare algorithms, more precisely. 
These measures are defined by: 



ACC-- 



TP+TN 



FPR- 



FP 



FDR- 



TP+FP+TN+FN 



FP+TN 



F = 2 



FP+TP 
PPV x TPR 
'PPV+TPR , 



MCC- 



TPR- 



TP + FP 



TP x TN — FP x FN 



(19) 



(TP + FP)(TP + FN)(TN + FP)(TN + FN) , 
TP 



TP + FN 



MCC is a convenient quantity for comparing predicted and actual 
networks. MCC quantity for each algorithm indicates which 
method is more efficient in predicting networks. The algorithm 
which has higher values for measures TP, TN, ACC, PPV, F, 
MCC and TPR is more efficient for predicting the skeleton of 
networks. 

The DREAM3 Challenge consists of 4 data sets that were 
produced from in-silico networks. The goal of the in-silico 
Challenge is the reverse engineering of gene networks from time 
series data. The gold standard for DREAM3 Challenge were 
determined from source networks of real species. In this study, we 
tested the performance of IPCA-CMI on the DREAM3 data sets 



Table 6. The result of gene expression data set DREAM3 Challenge with 50 genes and sample number 50. 



Algorithm TP FP ACC FPR FDR PPV F MCC TPR 



PCA1 


24 


23 


0.93 


0.02 


0.49 


0.51 


0.39 


0.37 


0.31 


PCA2 


22 


21 


0.93 


0.02 


0.49 


0.51 


0.37 


0.35 


0.29 


IPCA1 


28 


26.5 


0.94 


0.02 


0.48 


0.51 


0.43 


0.4 


0.36 


IPCA2 


22.9 


11.78 


0.95 


0.01 


0.48 


0.52 


0.38 


0.42 


0.3 



Result of DREAM3 in size of 50 with different CMI orders with threshold 0.1 . The second and third rows of the table indicate the result of first-order PCA-CMI (PCA1 ) and 
second-order PCA-CMI (PCA2), respectively. The forth and fifth rows of the table show the result of IPCA-CMI of first-order(IPCAI) and second-order(IPCA2), respectively. 
doi:1 0.1 371 /journal.pone.0092600.t006 
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Table 7. The result of gene expression data set DREAM3 Challenge with 100 genes and sample number 100. 



Algorithm 


TP 


FP 


ACC 


FPR 


FDR 


PPV 


F 


MCC 


TPR 


PCA1 


49 


25 


0.971 


0.005 


0.34 


0.66 


0.41 


0.43 


0.28 


PCA2 


46 


25 


0.971 


0.005 


0.35 


0.64 


0.38 


0.41 


0.27 


IPCA1 


53.11 


29.77 


0.972 


0.006 


0.35 


0.65 


0.43 


0.44 


0.32 


IPCA2 


46.55 


15.16 


0.973 


0.003 


0.24 


0.75 


0.4 


0.45 


0.28 



Result of DREAM3 in size of 1 00 with different CMI orders with threshold 0.1 . The second and third rows of the table indicate the result of first-order PCA-CMI (PCA1 ) and 
second-order PCA-CMI (PCA2), respectively. The forth and fifth rows of the table show the result of IPCA-CMI of first-order (IPCA1) and second-order (IPCA2), 
respectively. 

doi:1 0.1 371 /joumal.pone.0092600.t007 



with n variables and n samples (n = 10,50,100) and to experimental 
data from Escherichia coil containing 9 variables and 9 samples. 
Data sets contain the expression values of genes, in which rows are 
genes and columns indicate the samples. In order to compare 
results of PCA-CMI and IPCA-CMI, in each algorithm, we used 
the same threshold for CMI tests previously applied by Zhang [5], 
IPCA-CMI, is a combination of a constraint-based method 
named PCA-CMI with a score and search method named the HC 
algorithm. Since the HC algorithm includes the process of 
randomly selecting initial graphs, IPCA-CMI is supposed to run 
hundred times and then we take the average as the final result. It 
can be concluded that outcomes of Tables 1 to 5 are related to the 
average of results which obtained from IPCA-CMI in hundred 
times. 

Fig. 1(A) shows the structure of true network for DREAM3 
which contains 10 genes and 10 edges. The result obtained by 
Zhang [5] illustrated in Fig. 1(B), and Fig. 1(C) is related to the 
result of IPCA-CMI. In Fig. 1(B), edges that are correctly found by 
PCA-CMI are shown in Black color and the edge that wrongly 
inferred by this algorithm (edge G2-G4) is shown in red color. The 
true edges, which found by IPCA-CMI, are indicated by Black 
color and edge G4-G9 is a false negative. Fig. 1(C) is related to the 
best result of IPCA-CMI in running it hundred times. 

Table 4 indicates the result of PCA-CMI and IPCA-CMI with 
zero-order CMI test for DREAM3 and SOS real gene expression 
data. In zero-order two algorithms returned the same results, since 
both algorithms contain the same procedure. 

Table 5 indicates the result of PCA-CMI and IPCA-CMI for 
DREAM3 data set in size of 10 genes with 10 edges. We set the 
threshold value 0.05 of MI and CMI tests for dependency 
determination. As shown by Table 5, TP, ACC, PPV, F, MCC 
and TPR under PCA-CMI are less than those of IPCA-CMI. So, 
it can be concluded that the IPCA-CMI is more suitable for 
structure learning. 

Results of applying PCA-CMI and IPCA-CMI for DREAM3 
Challenge with 50 genes and 77 edges are collected in Table 6. We 
chose 0.1 as the threshold value of MI and CMI tests to determine 
the dependency between genes. IPCA-CMI can detect the true 



network in 2 steps, FP value is reduced as a result of applying 
algorithm step by step. According to Table 6 the FP value is 
reduced from 2 1 to 1 1 .78, as a result of using IPCA-CMI. The TP, 
ACC, PPV, F, MCC and TPR measures receive higher values by 
using IPCA-CMI for inferring GRNs which shows that the IPCA- 
CMI performs better than the PCA-CMI. 

Results of DREAM3 with 100 variables and 166 edges are 
illustrated in Table 7. Threshold value 0.1 for MI and CMI tests is 
considered to determine the dependency between genes. As shown 
by Table 7 in the second-order network, the FP value is reduced 
from 25 to 15.16. The TP, ACC, PPV, F, MCC and TPR 
measures receive higher values by using IPCA-CMI for inferring 
about DREAM3 with 100 variables. Results of applying PCA- 
CMI and IPCA-CMI for the real data set with 9 genes and 24 
edges are given in Table 8. We chose 0.01 as the threshold value. 
Table 8 indicates that ACC, F and MCC measures receive higher 
values by using IPCA-CMI for inferring about BNs which shows 
that the IPCA-CMI performs better than the PCA-CMI. 

According to Tables 4 to 8, the number of FP is decreased, as a 
result of using IPCA-CMI. So it can be concluded that the IPCA- 
CMI is more suitable for learning the structure of GRNs. Tables (4 
to 8) show that IPCA-CMI not only can reduce the number of FP 
but also it remarkably can find some true different edges in 
comparison with PCA-CMI. As shown by these Tables (4 to 8), 
some better results can be obtained by using IPCA-CMI. So, it can 
be concluded that IPCA-CMI performs better than the PCA-CMI 
for learning the structure of GRNs. Another comparison that can 
be made between these algorithms is a determination of the 
probability of selecting subgraph with k edges from graph G with m 
edges. These probabilities are calculated for two mentioned 
algorithms. The algorithm which receives smaller value of the 
probability is efficient for predicting the skeleton of GRN. Results 
of this comparison for networks which are obtained using 
DREAM3 and SOS real gene expression data are given in 
Table 9. As shown by Table 9, better results (e.g., smaller 
probability values) are obtained by using IPCA-CMI. Therefore, it 
can be concluded that the performance of IPCA-CMI is much 



Table 8. The result of experimental data from Escherichia coil containing 9 genes and sample number 9. 



Algorithm TP FP ACC FPR FDR PPV F MCC TPR 



PCA1 


18 


4 


0.72 


0.33 


0.18 


0.82 


0.78 


0.40 


0.75 


IPCA1 


18 


1.8 


0.73 


0.32 


0.17 


0.82 


0.79 


0.41 


0.75 



The result of SOS DNA repair network in size of 9 with 24 edges. Results are related to the order 1 of CMI with threshold 0.01 . The second row of the table indicates the 
result of first-order PCA-CMI (PCA1). The third row of the table show the result of IPCA-CMI of first-order (IPCA1). 
doi:1 0.1 371 /journal.pone.0092600.t008 
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Table 9. The probability of occurrence of GRNs. 



Algorithm DREAM 10 DREAM50 DREAM 100 SOS 

PCA 1.948475e-05 6.211307e-17 9.751 598e-53 0.01755 

IPCA 1.12846e-08 1.252285e-21 5.337701 e-59 0.001215584 



A determination of the probability of selecting a subgraph using the PCA-CMI 
and the IPCA-CMI. The second row of the table indicates the result of last order 
PCA-CMI (PCA). The third row of the table show the result of last order IPCA-CMI 
(IPCA). 

doi:1 0.1 371 /journal.pone.0092600.t009 

better than that of PCA-CMI based on the better determination of 
the probability for selecting a subgraph in all data sets. 

Discussion 

In this study a new algorithm called IPCA-CMI for inferring 
GRNs from gene expression data was presented. Results of this 
study show that using IPCA-CMI improves the precision of the 
learning the structure of GRNs, considerably. Zhang [5] reported 
that the PCA-CMI performed better than linear programming 
method [44] , multiple linear regression Lasso method [45] , mutual 
information method [46] and PC-Algorithm based on partial 
correlation coefficient [27] for inferring networks from gene 
expression data such as DREAM3 Challenge and SOS DNA 
repair network. Therefore, it can be concluded that the results of 
IPCA-CMI will be more precise compared to the methods studied 
by Zhang [5] . 

Our algorithm starts with a complete undirected graph over all 
variables. IPCA-CMI constructs 5) (the skeleton of order ij 
according to CMI test. Then perform the HC algorithm to direct 
the edges of Sj. If X and Y are adjacent in S,-, weight values are 
defined for variables in set Qxy = {ADJ(X)(JADJ( Y)}\{X, Y}. 
Subsequently, variables with high weight values were selected as 
the members of the set Rxy- The separator set being a subset of 
Rxy, makes defining the set Rxy in the algorithm very important. 
We adopted a method to select i number of genes from Rxy- 
Suppose that, there are t number of genes in Rxy i\Rxr\ = t). In 
order to construct the z'-order (/</) network, all the z'-order CMIs 
between X and Y given all possible combination of i genes from t 
genes are calculated and the maximum result compared with c 
threshold to decide whether to keep the edge between X and Y or 
to remove it. 

The PC algorithm starts with a complete undirected graph over 
all variables. In order to construct Si, the Chi-square test is applied 
to determine dependency between variables. The separator set for 
adjacent genes X and Y in Si are selected from Qxy- The PC 
algorithm is fast to learn networks with many variables. The 
drawback of the PC algorithm is the requirement for large sample 
sizes to perform high order conditional independence (CI). The 
number of records in a microarray data set is rarely sufficient to 
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